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Abstract 

Submodular function minimization is a fundamental optimization problem that arises in several ap¬ 
plications in machine learning and computer vision. The problem is known to be solvable in polynomial 
time, but general purpose algorithms have high running times and are unsuitable for large-scale problems. 
Recent work have used convex optimization techniques to obtain very practical algorithms for minimizing 
functions that are sums of “simple" functions. In this paper, we use random coordinate descent methods 
to obtain algorithms with faster linear convergence rates and cheaper iteration costs. Compared to al¬ 
ternating projection methods, our algorithms do not rely on full-dimensional vector operations and they 
converge in significantly fewer iterations. 


1 Introduction 


Over the past few decades, there has been a significant progress on minimizing submodular functions, leading 
to several polynomial time algorithms for the problem Despite this intense focus, the running 

times of these algorithms are high-order polynomials in the size of the data and designing faster algorithms 
remains a central and challenging direction in submodular optimization. 

At the same time, technological advances have made it possible to capture and store data at an ever increasing 
rate and level of detail. A natural consequence of this “big data" phenomenon is that machine learning 
applications need to cope with data that is quite large and it is growing at a fast pace. Thus there is an 
increasing need for algorithms that are fast and scalable. 

The general purpose algorithms for submodular minimization are designed to provide worst-case guarantees 
even in settings where the only structure that one can exploit is submodularity. At the other extreme, 
graph cut algorithms are very efficient but they cannot handle more general submodular functions. In many 
applications, the functions strike a middle ground between these two extremes and it is becoming increasingly 
more important to use their special structure to obtain significantly faster algorithms. 

Following noma eng, we consider the problem of minimizing decomposable submodular functions that can 
be expressed as a sum of simple functions. We use the term simple to refer to functions F for which there 
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is an efficient algorithm for minimizing F + w, where w is a linear function. We assume that we are given 
black-box access to these minimization procedures for simple functions. 

Decomposable functions are a fairly rich class of functions and they arise in several applications in machine 
learning and computer vision. For example, they model higher-order potential functions for MAP inference 
in Markov random fields, the cost functions in SVM models for which the examples have only a small number 
of features, and the graph and hyper graph cut functions in image segmentation. 

The recent work of ®mm has developed several algorithms with very good empirical performance that 
exploit the special structure of decomposable functions. In particular, [B] have shown that the problem 
of minimizing decomposable submodular functions can be formulated as a distance minimization problem 
between two polytopes. This formulation, when coupled with powerful convex optimization techniques such 
as gradient descent or projection methods, it yields algorithms that are very fast in practice and very simple 
to implement [Bj. 

On the theoretical side, the convergence behaviour of these methods is not very well understood. Very 
recently, Nishihara et al. m have made a significant progress in this direction. Their work shows that the 
classical alternating projections method, when applied to the distance minimization formulation, converges 
at a linear rate. 

Our contributions. In this work, we use random coordinate descent methods in order to obtain algorithms 
for minimizing decomposable submodular functions with faster convergence rates and cheaper iteration costs. 
We analyze a standard and an accelerated random coordinate descent algorithm and we show that they 
achieve linear convergence rates. Compared to alternating projection methods, our algorithms do not rely 
on full-dimensional vector operations and they are faster by a factor equal to the number of simple functions. 
Moreover, our accelerated algorithm converges in a much smaller number of iterations. We experimentally 
evaluate our algorithms on image segmentation tasks and we show that they perform very well and they 
converge much faster than the alternating projection method. 

Submodular minimization. The first polynomial time algorithm for submodular optimization was ob¬ 
tained by Grotschel et al. [3j using the ellipsoid method. There are several combinatorial algorithms for the 
problem czjejeuii. Among the combinatorial methods, Orlin’s algorithm m achieves the best time com¬ 
plexity of 0(n 5 T + n 6 ), where n is the size of the ground set and T is the maximum amount of time it takes 
to evaluate the function. Several algorithms have been proposed for minimizing decomposable submodular 
functions PH 1E1 EDI. Stobbe and Krause m use gradient descent methods with sublinear convergence 
rates for minimizing sums of concave functions applied to linear functions. Nishihara et al. m give an 
algorithm based on alternating projections that achieves a linear convergence rate. 

1.1 Preliminaries and Background 

Let V be a finite ground set of size n; without loss of generality, V = {1,2,... ,n}. We view each point 
w £ R 11 as a modular set function w(A) = J^ ieA w * on the ground set V. 

A set function F : 2 V — > K. is submodular if F(A) + F(B) > F(AC\B) + F(AUB) for any two sets A,BCV. 
A set function Fj : 2 V —> R. is simple if there is a fast subroutine for minimizing Fj + w for any modular 
function w € 1". 

In this paper, we consider the problem of minimizing a submodular function F : 2 V —>• R of the form 
F = yr_ 1 F! ( , where each function Fj is a simple submodular set function: 

r 

min F(A) = min V FA A) (DSM) 

A CV A 

“ i -1 

We assume without loss of generality that the function F is normalized, i.e., F(0) = 0. Additionally, 
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we assume we are given black-box access to oracles for minimizing F t + w for each function F t in the 
decomposition and each w £ M™. 

The base polytope B(F) of F is defined as follows. 

B{F) = {w £ K" | w(A) < F(A) for all AC V, 
w(V) = F(V)j 

The discrete problem (DSM^jadmits an exact convex programming relaxation based on the Lovasz extension 
of a submodular function. The Lovasz extension / of F can be written as the support function of the base 
polytope B(F): 

f(x)= max (w,x) \/x £ K™ 

Even though the base polytope B(F) has exponentially many vertices, the Lovasz extension / can be 
evaluated efficiently using the greedy algorithm of Edmonds (see for example [TS]). Given any point x £ R", 
Edmonds’ algorithm evaluates f(x) using O(nlogn) x T time, where T is the time needed to evaluate the 
submodular function F. 

Lovasz showed that a set function F is submodular if and only if its Lovasz extension / is convex [5]. Thus 
we can relax the problem of minimizing F to the following non-smooth convex optimization problem: 

r 

min f(x) = min > fi(x) 
xe[o,i] n xe[o,i] n/ -^ 

2=1 


where /* is the Lovasz extension of Fj. 

The relaxation above is exact. Given a fractional solution x to the Lovasz Relaxation, the best threshold set 
of x has cost at most /( x). 

An important drawback of the Lovasz relaxation is that its objective function is not smooth. Following 
previous work [sum, we consider a proximal version of the problem (||- || denotes the f? 2 " n orm): 

mm (7( x) + i ||a;|| 2 ^ = min ^ (f z (x) + ^ ||a;|| 2 j (Proximal) 

Given an optimal solution x to the proximal problem min^gRn (/(ir) + | ||a;|| 2 ), we can construct an optimal 
solution to the discrete problem (DSM) by thresholding x at zero; more precisely, the set {v £V: x(v ) > 0} 
is an optimal solution to (DSM) (Proposition 8.6 in [T|). 

Lemma 1 ([6j). The dual of the proximal problem 


min 



1 

2 r 


11*11 


is the problem 


max 

yC) & B(F 1 ),...,yf.r') & B(F T .) 


1 

2 



2 


The primal and dual variables are linked as x = — Xx=i ■ 


Lemma [l] was proved in [Bj; we include a proof in Section [A] for completeness. 

1 DSM stands for decomposable submodular function minimization. 
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RCDM Algorithm for (Prox-DSM) 

((We can take the initial point yo to be 0)) 

Start with y 0 = (y^\ ..., y^ } ) G y 
In each iteration k (k > 0) 

Pick an index i k G {1,2,..., r} uniformly at random 
((Update the block i k }) 


(ik) 

y k+1 argmin 

y£B(F t .) 


((Vi k g(yk),y - y k 


( 4 )^ 


(*k) 

2 \ 

y-Vk 

) 


Figure 1: Random block coordinate descent method for (Prox-DSM). It finds a solution to (Prox-DSM) 
given access to an oracle for min y6 B(i?.) (( y , a) + ||y|| 2 ). 


We write the dual proximal problem in the following equivalent form: 


min 



(Prox-DSM) 


It follows from the discussion above that, given an optimal solution y — (y^\ ... ,y^) to (Prox-DSM), we 
can recover an optimal solution to (DSM) by thresholding x = — j/W at zero. 


2 Random Coordinate Descent Algorithm 

In this section, we give an algorithm for the problem (Prox-DSM) that is based on the random coordinate 
gradient descent method (RCDM) of [TO]. The algorithm is given in Figure [l] The algorithm is very easy to 

implement and it uses oracles for problems of the form min y6 B{Fi) {(y, a) + ||y|| 2 ^ > where i G [r] and a G R n . 
Since each function F, t is simple, we have such oracles that are very efficient. 

In the remainder of this section, we analyze the convergence rate of the RCDM algorithm. We emphasize 
that the objective function of (Prox-DSM) is not strongly convex and thus we cannot use as a black-box 
Nesterov’s analysis of the RCDM method for minimizing strongly convex functions. Instead, we exploit the 
special structure of the problem to achieve convergence guarantees that match the rate achievable for strong 
convex objectives with strong convexity parameter l/(n 2 r). Our analysis shows that the RCDM algorithm 
is faster by a factor of r than the alternating projections algorithm from QT!]. 

Outline of the analysis: Our analysis has two main components. First, we build on the work of m in 
order to prove a key theorem (Theorem [2]) . This theorem exploits the special structure of the (Prox-DSM) 
problem and it allows us to overcome the fact that the objective function of (Prox-DSM) is not strongly 
convex. Second, we modify Nesterov’s analysis of the RCDM algorithm for minimizing strongly convex 
functions and we replace the strong convexity guarantee by the guarantee given by Theorem [2] 

We start by introducing some notation; for the most part, we follow the notation of m and m- Let 
R" 1 " = (^)[ =1 R". We write a vector y G R nr as y = (y^\ ..., y^), where each block j/W i§ an n-dimensional 
vector. Let y = (S>[ = i B(Fi) be the constraint set of (Prox-DSM). Let g : R Tlr —> R be the objective function 
of (Prox-DSM): g(y) = 11|| ■ We use Vg to denote the gradient of g , i.e., the (w)-dimensional vector 
of partial derivatives. For each i G {1,..., r}, we use V ig(y) G R n to denote the i -th block of coordinates of 

Vs(y). 
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Let S £ R nxnr be the following matrix: 


S= — 


I71. In, * * * In. 


Note that g(y) = r||S'y|| 2 and Vy(y) = 2 rS T Sy. Additionally, for each i £ {1,2,...,r}, is Lipschitz 
continuous with constant Li = 2: 




V»ff(y) || < Li 




(1) 


for all vectors x, y £ K nr that differ only in block i. 

Our first step is to prove the following key theorem that builds on the work of [121- 


Theorem 2. Let y £ y be a feasible solution to (Prox-DSM). Let y* be an optimal solution to (Prox-DSM) 
that minimizes \\y — y*\\. We have 


\\S( y -y*)\\>^-\\y-y*\\. 


The proof of Theorem [2] uses the following key result from m ■ We will need the following definitions from 

Ha- 

Let d(I\i,K 2 ) = inf {||fci — k 2 \\ : k\ £ K\,k 2 £ K 2 } be the distance between sets Ki and K 2 . Let V and Q 
be two closed convex sets in M. d . Let E CV and H C Q be the sets of closest points 


E={p£V: d{p , Q) = d(V,Q)} 

H = {q £ Q- d(q,V) = d(V,Q)} 

Since V and Q are convex, for each point in p £ E, there is a unique point q £ H such that d(p, q) = d{V, Q) 
and vice versa. Let v = IIq_-pO; note that H = E + v. Let Q' = Q — v\ Q' is a translated version of Q and 
it intersects V at E. Let 

g?(x, E) 

xeCPuQ’)\E max {d{x, V),d(x, Q')} ' 

By combining Corollary 5 and Proposition 11 from na, we obtain the following theorem. 

Theorem 3 (HS)- If ^ the polyhedron ®[ =1 B(Fi ) and Q is the polyhedron {y £ R nr : = 0}, 

we have k * < nr. 


Now we are ready to prove Theorem [2j Let 

V = (g)' =i B(Fi) = y 

Q={yer r : ^’ =1 yb) = 0 } = {y £ M" r : Sy = 0} 


We define Q' and k* as above. 


Let y and y* be the two points in the statement of the theorem. Note that y £ V and y* £ E, since E is the 
set of all optimal solutions to (Prox-DSM) (see Proposition 10 in Section |B] for a proof). We may assume 
that y (j E, since otherwise the theorem trivially holds. Since y £ V\ E, we have 


k* > 


d{y,E) 

d{y , QO 
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Since y* is an optimal solution that is closest to y , we have d(y,E ) = ||y — y*||. Using the fact that the 
rows of S form a basis for the orthogonal complement of Q, we can show that d(y, Q') = — j/*)|| (see 

Proposition 11 in Section [B] for a proof). Therefore 


K* > 


I \y-y* 


II S{y-y* 


Theorem [2] now follows from Theorem [3l 

In the remainder of this section, we use Nesterov’s analysis m in conjunction with Theorem [ 2 ] in order to 
show that the RCDM algorithm converges at a linear rate. Recall that E is the set of all optimal solutions 
to (Prox-DSM). 

Theorem 4. After [k + 1) iterations of the RCDM algorithm, we have 

/ 2 \ k+1 

E [d(y k , E) 2 + g{y k+ i) - g{y*)\ < ( 1- 2 2 ) (d(y 0 , E ) 2 + 5 ( 3 / 0 ) - g(y*)) , 

where y* = axgmm yeE \\y — y k \\ is the optimal solution that is closest to y k . 


We devote the rest of this section to the proof of Theorem [4] We recall the following well-known lemma, 
which we refer to as the first-order optimality condition. 

Lemma 5 (Theorem 2.2.5 in [H]). Let f : —> R. be a differentiable convex function and let Q C M. d be a 

closed convex set. A point x* £ is a solution to the problem min^gQ f(x) if and only if 

(V/(a;*), x — x*) > 0 


for all x £ Q. 


It follows from the first-order optimality condition for ;t/[(f j that, for any 2 £ B(Fi k ), 


V ik9(Vk ) + L lk (y { ^\ - y , 2 - y^ > 0 


( 2 ) 


We have 


9{yk+i) = 9iVk) + / (Vk+i ~ Vk, Vg(y fc + t(y k +i - yk)))dt 
Jo 

= g{yk) + (Vg(yk),yk+1 - Vk) + / (yk+i - Vk, ^g(yk + t(y k+1 - y k )) - Vg(y k ))dt 

Jo 


= g(yk) + (v i k g{y k ),y k +i - y k k) ) + I (y k +[ - y k k) ,V ik g(y k + t(y k+ i - y k )) - \7 %k g(y k ))dt 


(^fc) 


< g(yk) + {'Vi k 9 (y k ),y < kl\ ~ y k k) ) + 


w&l Jk k) 


? g(yk) + (v ik g(y k ),y k +\ - y k k) ) + I L, 


7/ (*fc) _ 7 /(*fc) 

Vk +1 yk 


Vi k g(yk + t(y k +1 - Vk)) - V ik g{y k )\\ dt 

2 


tdt 


= g{y k ) + {^i k g{y k ),y k +i - y k k) ) + 


L 


y ( : k) 


(3) 


On the third line, we have used the fact that y k and z/fc+i agree on all coordinate blocks except the i^-th 
block. On the fourth line, we have used the Cauchy-Schwartz inequality. On the fifth line, we have used 
inequality Q. 
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Let y* = argmin ygB \\y — y k \\ be the optimal solution that is closest to y k . We have 


\\Vk+i - y *|| 2 = || yk - y* f + hk+i - y k \\ 2 + 2 (y k - y*,y k +1 - yk) 

= \\y k -y*\\ 2 - \\y k +i - y k \\ 2 + 2 (y k +i -y*,y k+ 1 - y k ) 


= \\y k -y*\\ 2 - 


< Wvk-vT- 


7 / (4) _ 

sS - 


2 2 .(i*)\ 


.,*'1 (*fe) „(**) 


,(**) 




= ||y* - 2 / 1 2 + ^ (V^), (»*)<*> - 2 /r ) ) 


v it s(j/ fc ), (y*) (4) - 1 ®) 

(ifc)\ 


2 (U k 


U k \ 2 

Vk +1 Vk 


+ ( v V Jfc 3(j/fc) ; yi+|-2/r j / ) 




? Ili/fc - y*ir + j- ^v ifc 5 (y fe ), (y*) (lfc) - 2/^7 ~ yr (aivk+i) - g{y k )) 


(4) 


On the third line, we have used the fact that j/fc and j/fc+i agree on all coordinate blocks except the ?’fc-th 
block. On the fourth line, we have used the inequality (b| with z = (y*)b fe ). On the last line, we have used 
inequality ([3]). 

If we rearrange the terms of the inequality Q, take expectation over i k , and substitute Li k = 2, we obtain 


E, ; 


y k +i - y *|| 2 + g{y k +i) - g(y*) 


< I \y k - y*\\ 2 + g{y k ) - g{y*) + -(S7g(y k ),y* - y k ) (5) 


We can upper bound (Vg(y k ),y* — y k ) as follows. 

(^g(y k ),y* - y k ) = 2r(S T Sy k ,y* - y k ) 

= r ( S T Sy k + S T Sy*,y* - y k ) + r (S T Sy k - S T Sy* 7 y* - y k ) 

= r (S T Sy k + S T Sy*,y* -y k )-r ||5(y fc - y*)|| 2 
= r (S(y k + y*),S(y* - y k )) -r\\S(y k - y*)\\ 2 
= (g(y*) -g(yk))-r\\s(y k -y *)\\ 2 

< ( g(y *) - g(y k )) - | \y k - y *|| 2 (By Theorem^ ( 6 ) 


On the first and fifth lines, we have used the fact that Vg(z) = 2rS T Sz and g{z ) = r ||S;z || 2 for any 2 £ K" r . 
On the last line, we have used Theorem [2] 

Since y* is an optimal solution to (Prox-DSM), the first-order optimality condition gives us that 

(Vg{y*),y* - y k ) = 2r(S T Sy*,y* - y k ) < 0 (7) 


Using the inequality above, we can also upper bound (V g(y k ),y* — y k ) as follows. 

(Vff( 2 /fc), y* - y k ) = 2 r{S T Sy k , y* - y k ) 

= 2 r(S T Sy*,y* - y k ) + 2 r(S T Sy k - S T Sy*,y* - y k ) 

= 2r(S T Sy*,y* - y k ) - 2r\\S(y k - y *)|| 2 

f -2r||SQ/ fc -y*)|| 2 

<- %- 1| y k - y* || 2 (By Theorem pi) ( 8 ) 

n z r 1 
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APPROX algorithm applied to (Prox-DSM) 

Start with Zq = (z^\ ■ ■ ■, z^) £ y 

Oo t— y, Uq £- 0 

In each iteration k (k> 0) 

Generate a random set of blocks R k where each block 
is included independently with probability y 


Uk+i £- Uk, Zk+i ■£- Zk 

For each i £ R k 


t'k' 1 £~ argmin t+ ^( «) eB (F ifc ) (( V iS ( e l u k + Zk) ,t) + 2r0 k \\t\\ 2 ^ 


W 

u k +1 


(*) I 


4 8) 


l-rfli 


R*) 

r fc 


0fc+1 = 

Return + 2 fe+ i 


Figure 2: The APPROX algorithm of [2j applied to (Prox-DSM). It finds a solution to (Prox-DSM) given 
access to an oracle for min y ^B(Fi) ((y, a) + Il2/|| 2 )- 


By taking 


n 2 r+1 


(H) + (i 


yjy+y i x (81 , we obtain 
2 


(Vg{y k ),y* - yk) < - ^ x ( g(yk ) - ff(y*) + Ibfc - 2/*II 2 ) 


By 0 and 0, 


E 

ik 


Vk+i - y*f + g{yk+ 1) - g(y*) 


< l- 


n 2 r 2 + r 


j (g{yk) - g{y*) + II yk - y* II 2 ) 


Note that d(y k+1 ,E) 2 < \\y k +i - y*\\ and d(y k ,E) 2 = \\y k - y* || . Therefore 

E [d(y k +i,E) 2 + g(y k+1 ) - g(y*)] < (l - 2 2 ) (d(y k , E) 2 + g{y k ) - g(y*)) 

ik \ TL T T J 

By taking expectation over £ = (ii ,..., i k ), we get 

/ 2 \ k+1 

E [d(y k +i, E) 2 + g(y k+ 1 ) - g(y*)] < (l - ?j2? , 2 + J (%o, E) 2 + ff(y 0 ) - ff(y*)) 
Therefore the proof of Theorem [4] is complete. 


(9) 


3 Accelerated Coordinate Descent Algorithm 

In this section, we give an accelerated random coordinate descent (ACDM) algorithm for (Prox-DSM). The 
algorithm uses the APPROX algorithm of Fercoq and Richtarik [2] as a subroutine. The APPROX algorithm 
(Algorithm 2 in [2]), when applied to the (Prox-DSM) problem, yields the algorithm in Figure [2] The ACDM 
algorithm runs in a sequence of epochs (see Figure 0. In each epoch, the algorithm starts with the solution 
of the previous epoch and it runs the APPROX algorithm for ©(nr 3 / 2 ) iterations. The solution constructed 
by the APPROX algorithm will be the starting point of the next epoch. Note that, for each i, the gradient 



















ACDM Algorithm for (Prox-DSM) 

({We can take the initial point y 0 to be 0)) 

Start with y 0 = (y£\ ..., y^) £ y 
In each epoch £ (i > 0) 

Run the algorithm in Figure [ 2 ] for (4nr 3 / 2 + 1) 
iterations with ye as its starting point (zq = ye) 
Let ye+i be the vector returned by the algorithm 


Figure 3: Accelerated block coordinate descent method for (Prox-DSM). It finds a solution to (Prox-DSM) 
given access to an oracle for min^R^) (( 2 / 7 °) + IM| 2 


= 2 j/'b can be easily maintained at a cost of 0(n ) per block update, and thus the iteration cost 
is dominated by the time to compute projection. 

In the remainder of this section, we use the analysis of [2] together with Theorem [ 2 ] in order to show that 
the ACDM algorithm converges at a linear rate. We follow the notation used in Section [2] 

Theorem 6. After £ epochs of the ACDM algorithm (equivalently , (4nr 3 / 2 + 1)£ iterations), we have 

^[g(ye+ 1 ) - g{y*)\ < ( 5 ( 2 / 0 ) - g(y*)) 

In the following lemma, we show that the objective function of (Prox-DSM) satisfies Assumption 1 in [2. 
and thus the convergence analysis given in [ 2 j can be applied to our setting. 

Lemma 7. Let R C- {1,2,... ,r} be a random subset of coordinate blocks with the property that each i £ 

{1,2,.. ., r} is in R independently at random with probability 1/r. Let x and h be two vectors in R nr . Let 

h R be the vector in K rar such that (h R )^ = ftb) for each block i £ R and (h R )^ = 0 otherwise. We have 

E [g ( x + h R )] < g(x) + - (Vg(x), h) + - \\h\\ 2 . 


Proof: We have 

E [g (x + h R )\ = E r\\S(x + h R 
= E 
= E 


rHSTjl + r ||S7 ir|| +2 r(Sx,Sh R ) 
r HSirl) 2 + r ||S7 ir|| 2 + 2r ( S T Sx, h R ) 


= E 


g{x) 


r 

2 

J2 h n 

2=1 

+ (Vg(x),h R ) 


= sM + A E<^ 0) > + {E Ik* 


iAj 


< g(x) + 




i^3 


2=1 


h U) 


1 


(Vg(x),h) 


ElK 


2=1 


(\7g(x),h) 


< g(x) + \\ h{z 


(Vg(x),h) 


= g(x) + 2 \\hf + 1 (Vg{x),h) 
r r 
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□ 

Lemma [T] together with Theorem 3 in [2] give us the following theorem. 

Theorem 8 (Theorem 3 of [2]). Consider iteration k of the APPROX algorithm (see Figure [1|). Let yk = 
0(.Uk+ i + Zk+ 1- Let y* = argmin, y6£: \\y — yk\\ is the optimal solution that is closest to yk- We have 

E l g(yk ) - g(y*)\ < _^ +2r y (( x _ (■ 9 ( 2 °) ~ 2 ( 2 /*)) + 2 11-° - v* f) 

Proof: It follows from Lemma [7] that the objective function g of (Prox-DSM) and the random blocks Rk 
used by the APPROX algorithm satisfy Assumption 1 in (2J with r = 1 and = 4 for each *e{l, 2 ,...,r}. 
Thus we can apply Theorem 3 in [2]. □ 

Consider an epoch 1. Let ye+i be the solution constructed by the APPROX algorithm after 4?rr 3//2 + 1 
iterations, starting with y(. Let y* = argmin yeE \\y — yt+i\\ be the optimal solution that is closest to yt+\. 
Let denote the random choices made during epoch t. By Theorem [8j 

fl9(ye+i) - g{y*)] < (4w , 3 / 2 r + 2r)2 (( x - J) (ff(w) - 9(v*)) + 2 \\vt - iff) 

- (2nr 1 /2 + 1)2 (■ 9{ - y ~ 9+ 2 ll2/£ _ y *W 2 ) 

We also have 

give) = g{y*) + (Vg(y*),yt -y*)+ [ (V g{y* + t(y e - y*)) - Xg(y*),y e - y*)dt 

JO 

> g{y*) + f (Vg(y* + t(y e - y*)) ~ Vg{y*),ye - y*)dt 

Jo 

= g(y*)+[ 2tr\\S(y e -y*)f dt 
Jo 

= g(y*) + r\\S(y e -y*)\\ 2 

> g{y*) + ~4~ II ye -y* II 2 (By Theorem [2j 

n z r 

In the second line, we have used the first-order optimality condition for y* (Lemma [5]). In the last line, we 
have used Theorem [2] 

Therefore 

II ye - y*f < n 2 r{g{ye) - g(y*)) 

and hence 


—I— 1 

f{ 9 (ye+i) -g(y*)} < (2nr , 1/2 + 1)2 (g(w) -g(y*)) 

< \{g{vi) -g{y*)) 

Let £ = (£oj • • • ,£<) be the random choices made during the epochs 0 to l. We have 

^{g(yt+ 1) - g(y*)] < (5(2/0) - g{y*)) 

This completes the proof of Theorem [6] and the convergence analysis for the ACDM algorithm. 
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(a) Penguin 



(b) ACDM 1 
u s = 1.28 ■ 10 7 
u d = 1.3 ■ 10 5 



(g) AP 100 
u s = 7.64 ■ 10 6 
u d = 8.3 ■ 10 4 



(c) ACDM 20 
u s = 8.38 ■ 10 6 
u d = 8.14- 10 4 



(d) ACDM 100 
u s = 2.9 ■ 10® 
u d = 1.5 ■ 10 4 


(e) AP 1 
u s = 9.98 ■ 10® 
u d = 1.06 • 10 s 


(f) AP 20 
u a = 8.96 ■ 10® 
u d = 1.05 ■ 10 5 



Figure 4: Penguin segmentation results for the fastest (ACDM) and slowest (AP) algorithms, after 1, 20, 
and 100 projections. The v s and v d values are the smooth and discrete dual gaps. 

4 Experiments 

Algorithms. We empirically evaluate and compare the following algorithms: the RCDM described in 
Section [2] the ACDM described in Section [ 3 ] and the alternating projections (AP) algorithm of [T2]. The 
AP algorithm solves the following best approximation problem that is equivalent to (Prox-DSM): 

min ||a — y\\ 2 (Best-Approx) 

aeA,yey 

where A = {(a^ 1 ), a^ 2 \ ..., a € M nr : El=i a ^ = an d 3^ = B(Fi)- 

The AP algorithm starts with a point ao € A and it iteratively constructs a sequence {(«fe,2/fc)} fc>0 by 
projecting onto A and y: y k = n y (a fe ), a k +i = U A (y k ). 

Ilif(-) is the projection operator onto K 1 that is, II k{x) = argmin 2gif ||a; — z\\. Since A is a subspace, it 
is straightforward to project onto A. The projection onto y can be implemented using the oracles for the 
projections onto the base polytopes of the functions F,. 

For all three algorithms, the iteration cost is dominated by the cost of projecting onto the base polytopes 
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# projections # projections # projections 


(a) Smooth gaps - Octopus 


(b) Smooth gaps - Penguin 


(c) Smooth gaps - Plane 




# projections 


# projections 


# projections 


(d) Smooth gaps - Small plant 


(e) Discrete gaps - Octopus 


(f) Discrete gaps - Penguin 




(g) Discrete gaps - Plane (h) Discrete gaps - Small plant 

Figure 5: Comparison of the convergence of the three algorithms (UCDM, ACDM, AP) on four image 
segmentation instances. 
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B(Fi). Therefore the total number of such projections is a suitable measure for comparing the algorithms. 
In each iteration, the RCDM algorithm performs a single projection for a random block i and the ACDM 
algorithm performs a single projection in expectation. The AP algorithm performs r projections in each 
iteration, one for each block. 


Image Segmentation Experiments. We evaluate the algorithms on graph cut problems that arise in 
image segmentation or MAP inference tasks in Markov Random Fields. Our experimental setup is similar 
to that of [B. We set up the image segmentation problems on a 8-neighbor grid graph with unary potentials 
derived from Gaussian Mixture Models of color features m■ The weight of a graph edge (i,j) between pixels 
i and j is a function of exp(— ||t;j — Vj || ), where Vi is the RGB color vector of pixel i. The optimization 
problem that we solve for each segmentation task is a cut problem on the grid graph. 

Function decomposition: We partition the edges of the grid into a small number of matchings and we 
decompose the function using the cut functions of these matchings. Note that it is straightforward to project 
onto the base polytopes of such functions using a sequence of projections onto line segments. 


Duality gaps: We evaluate the convergence behaviours of the algorithm s using the following measures. Let 
y be a feasible solution to the dual of the proximal problem (ProximalI. The solution x = — yW is a 
feasible solution for the proximal problem. We define the smooth duality gap to be the difference between 
the objective values of the primal solution x and the dual solution y: v s = (^f(x) + | ||:r|| 2 ^ — | ||Sy|| 2 ^. 

Additionally, we compute a discrete duality gap for the discrete problem (DSM) and the dual of its Lovasz 
relaxation; the latter is the problem m.ax ze B(F){ z )-{V), where (z)_ = min { 2 , 0} applied elementwise (BJ. 
The best level set S x of the proximal solution x = — Y!i=i is a solution to the discrete problem (DSM). 
The solution 2 = — x = is a feasible solution for the dual of the Lovasz relaxation. We define the 

discrete duality gap to be the difference between the objective values of these solutions: Vd{x) = F(S X ) — 

(-z)-(n 


We evaluated the algorithms on four image segmentation instance^] [71 ITB]. Figure [ 5 ] shows the smooth 
and discrete duality gaps on the four instances. Figure [4] shows some segmentation results for one of the 
instances. 
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A Proof of Lemma |T| 


By the definition of the Lovasz extension, for each i £ [r], we have 

fi(x)= max (y M ,x). 


Therefore 


mm 

xe 


j?„E [M x ) + ^\\ x " 2 

i= 1 


2 r 


= min V f max (y^'\x) + — l|a;|| 2 
zeR ' 2r 


= min max ( (y'-‘ l \x) 

ccGR" yW e B(Ft),...,yMeB(F r ) “ ' 


max 


5< 1 )£B(J ? i),...,sMeB(F r ) 


jn ^ ( (y (l 


*) 


2r 


2r 


max — 

t/( 1 )6B(Fi),...,i/WEB(B r ) 2 


E^ 

*=i 


(i) 


On the third line, we have used the fact that the function (y,x) + (1/2r) ||a;|| 2 is convex in cc and linear in 
y, which allows us to exchange the min and the max (see for example Corollary 37.3.2 in Rockafellar [T5]). 
On the fourth line, we have used the fact that the minimum is achieved at x = — y/' j/W 


B Proofs omitted from Section [2] 

If x G K" r and X is a subspace of K nr , we let n^(x) denote the projection of x on X , that is, II*(x) = 
argmin zeH „. r ||x — z\\. We let X - 1 denote the orthogonal complement of the subspace X. 

Proposition 9. For any point x € R nr , IIqx( 2 ;) = S T Sx and thus IIg(a;) = x — S T Sx. 

Proof: Since Q is the null space of S, is the row space of S. Since the rows of S are orthonormal, they 
form a basis for Q- 1 . Therefore, if we let i>i,..., v n denote the rows of S, we have 

n 

n Q j_(a;) = YAx,v*)vi = S T Sx. 

i—l 

□ 


Proposition 10. The set of all optimal solutions to (Prox-DSM) is equal to E. 
Proof: We have 


d(V,Q) = min \\y - II Q (y)|| 
v&v 

= min 11 S T Sy 11 ((By Proposition [9|) 

= min \\Sy\\ 
v&v 


Since (Prox-DSM) is the problem min y6 -p r ||Sj/|| , 


E is the set of all optimal solutions to (Prox-DSM). 


□ 
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Proposition 11. Let y G W nr and let p G E. We have d(y, Q') = \\S(y — p)\\. 
Proof: Since Q! = Q — v, we have 


d(y, Q') = d(y + v, Q) 

= ll n Q^(2/ + ^)ll 

= ||S' T S'(j/ + i;)|| ((By Proposition [9|) 
= || S T S(y - 5 ,T S'p)|| ((Since v = -S : 
= ||S T S(y - p) || ((Since SS T = J„» 

= l|S(l/-p)|| 


Sp)) 


